首页> 外文OA文献 >Comparative Mining of B2C Web Sites by Discovering Web Database Schemas
【2h】

Comparative Mining of B2C Web Sites by Discovering Web Database Schemas

机译:通过发现Web数据库架构比较挖掘B2C网站

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Discovering potentially useful and previously unknown information or knowledge from heterogeneous web contents such as \u22list all laptop prices from Walmart and Staples between 2013 and 2015 including make, type, screen size, CPU power, year of make\u22, would require the difficult task of finding the schema of web documents from different web pages, performing web content data integration, building their virtual or physical data warehouse integration before web content extraction and mining from the database. Wrappers that extract target information from web pages can be manual, semi-supervised or automatic systems. Automatic systems such as the WebOMiner system, use some data extraction techniques based on parsing the web page html source code into a document object model (DOM) tree, then traverse the DOM for pattern discovery. Some limitations of these existing systems include using complicated matching techniques such as tree matching, Finite state automata, not yielding accurate results for complex queries such as historical and derived.This paper proposes building the WebOMiner S which uses web structure and content mining approaches on the DOM-tree html code to simplify and make more easily extendable, the web data extraction process of theWebOMiner system. TheWebOMiner system is based on non-deterministic finite state automata (NFA) to recognize and extract web different types (e.g., text, image, links, and lists). The proposed WebOMiner S replaces the use of NFA of the WebOMiner with a frequent structure finder algorithm which uses regular expression matching in Java xpath parser and methods (such as compile(),evaluate()) to dynamically discover the most frequent structure (which is the most frequently repeated blocks in the html code represented as tags \u3c divclass = \u22 \u22 \u3e) in the Dom tree. This approach eliminates the need for any supervised training or updating the wrapper for each new B2C web page making the approach simpler, more easily extendable and automated.
机译:从异构Web内容中发现潜在有用且先前未知的信息或知识,例如,列出2013年至2015年之间沃尔玛和Staples的所有笔记本电脑价格,包括品牌,类型,屏幕尺寸,CPU能力,品牌年份,这将是一项艰巨的任务从不同的网页查找Web文档的模式,执行Web内容数据集成,在从数据库中提取和挖掘Web内容之前建立其虚拟或物理数据仓库集成。从网页提取目标信息的包装器可以是手动,半监督或自动系统。诸如WebOMiner系统之类的自动系统使用一些数据提取技术,该技术基于将网页html源代码解析为文档对象模型(DOM)树,然后遍历DOM进行模式发现。这些现有系统的一些局限性包括使用复杂的匹配技术(例如树匹配,有限状态自动机),无法对诸如历史记录和派生类之类的复杂查询生成准确的结果。 DOM树html代码可简化WebOMiner系统的Web数据提取过程并使之易于扩展。 WebOMiner系统基于非确定性有限状态自动机(NFA)来识别和提取不同类型的Web(例如,文本,图像,链接和列表)。提出的WebOMiner S用频繁结构查找器算法代替了WebOMiner的NFA,该算法在Java xpath解析器中使用正则表达式匹配和方法(例如compile(),evaluate())来动态发现最频繁的结构(即html代码中最频繁重复的块,表示为Dom树中的标签\ u3c divclass = \ u22 \ u22 \ u3e)。这种方法消除了对每个新的B2C网页进行任何有监督的培训或更新包装程序的需要,从而使该方法更加简单,易于扩展和自动化。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号